[quantization] Introduce a script for LLM evaluation by stamalakhov · Pull Request #467 · Samsung/TICO

stamalakhov · 2026-02-05T07:57:53Z

This PR introduces an option to run many LLM-related tasks using lm_eval package.

To use it please make sure that you have installed lm_eval

pip install lm-eval

The PR gives a chance to range quantization results not only using PPL degradation but also using accuracy on a list of benchmarks.

python tico/quantization/evaluation/script/llm_tasks_eval.py --model "HuggingFaceTB/SmolLM2-135M-Instruct"


Loading FP model …
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2376/2376 [00:27<00:00, 85.22it/s]
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9501/9501 [06:03<00:00, 26.17it/s]
results of HuggingFaceTB/SmolLM2-135M-Instruct evaluation:
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |↑  |0.5400|±  |0.0102|
|        |       |none  |     0|acc_norm|↑  |0.4882|±  |0.0103|

Draft: #436
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

mhs4670go · 2026-02-05T09:16:58Z

tico/quantization/evaluation/script/llm_tasks_eval.py

+) -> dict[str, Any]:
+    model_to_evaluate = HFLM(model, "causal", tokenizer=tokenizer)
+    tasks_list: list[str] = tasks.split(",")
+    return evaluator.simple_evaluate(model_to_evaluate, tasks=tasks_list)["results"]


FYI, there is a visualization API.

from lm_eval.utils import make_table results = lm_eval.simple_evaluate(...) print(make_table(results))

Wow. Thank you! I didn't know that.

@mhs4670go I'll update the script.

mhs4670go

LGTM

Torrero · 2026-02-05T09:18:55Z

tico/quantization/evaluation/script/llm_tasks_eval.py

+import argparse
+from typing import Any
+
+from lm_eval import evaluator


Overall LGTM,
But, could you provide your opinion, please, maybe lm_eval should be added to dependencies of the project

I don't think we would add a dependency on lm_eval. That is why this script is located at just script folder and has no tests.

When we need external dependencies like transformers or lm_eval, it would be good to have dedicated scripts or workflows in internal repo.

This PR introduces an option to run many LLM-related tasks using `lm_eval` package. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go

LGTM

stamalakhov requested review from a team and mhs4670go February 5, 2026 07:57

stamalakhov self-assigned this Feb 5, 2026

stamalakhov force-pushed the lm_eval branch 3 times, most recently from 35a94ad to 35b24da Compare February 5, 2026 08:51

mhs4670go reviewed Feb 5, 2026

View reviewed changes

mhs4670go previously approved these changes Feb 5, 2026

View reviewed changes

Torrero approved these changes Feb 5, 2026

View reviewed changes

stamalakhov dismissed mhs4670go’s stale review via d712563 February 5, 2026 09:54

stamalakhov force-pushed the lm_eval branch from 35b24da to d712563 Compare February 5, 2026 09:54

stamalakhov requested a review from mhs4670go February 6, 2026 12:56

[quantization] Introduce a script for LLM evaluation

d0b5adb

This PR introduces an option to run many LLM-related tasks using `lm_eval` package. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the lm_eval branch from d712563 to d0b5adb Compare February 6, 2026 13:01

mhs4670go approved these changes Feb 9, 2026

View reviewed changes

mhs4670go merged commit 3436e0f into Samsung:main Feb 9, 2026
7 checks passed

stamalakhov deleted the lm_eval branch February 9, 2026 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Introduce a script for LLM evaluation#467

[quantization] Introduce a script for LLM evaluation#467
mhs4670go merged 1 commit intoSamsung:mainfrom
stamalakhov:lm_eval

stamalakhov commented Feb 5, 2026 •

edited

Loading

Uh oh!

mhs4670go Feb 5, 2026

Uh oh!

stamalakhov Feb 5, 2026

Uh oh!

stamalakhov Feb 5, 2026

Uh oh!

mhs4670go left a comment

Uh oh!

Torrero Feb 5, 2026

Uh oh!

mhs4670go Feb 5, 2026

Uh oh!

mhs4670go left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stamalakhov commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhs4670go Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go left a comment

Choose a reason for hiding this comment

Uh oh!

Torrero Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stamalakhov commented Feb 5, 2026 •

edited

Loading